class: center, middle, inverse, title-slide # Databases ### Kirill & Nicolas ### cynkra GmbH ### March 15, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 17px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> # Introduction Organization of half-day R courses: - Intro courses: * Tidyverse intro I * Base R intro/Tidyverse intro II * Data visualization I * Data visualization II - Advanced courses: * Advanced tidyverse * R package creation * Working with database systems * Parallelization & efficient R programming * Databases (this course) --- # Course material Our course material currently is available from GitHub at https://github.com/cynkra/bag-courses Today we will be looking at the folder `5_databases` --- # Download course material  --- # General remarks - We hope for these courses to be interactive: go ahead and ask if something is unclear! - You can also write into the chat, which I will try to monitor when Kirill is presenting. - We were asked to provide recordings of the courses for those of you who cannot join, so recording is activated. - Per course unit, we offer 4 hours of follow up time; approach us with questions (nicolas@cynkra.com)! --- # Goals for today Playing the whole game! - Extract - Transform - Load - **Consume** --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Subtle issues to watch out for - Basic ETL for one table ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - **Read whole tables** - Let the database do the heavy lifting - Subtle issues to watch out for - Basic ETL for one table ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/11.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- background-image: url("data:image/png;base64,#images/11-frame.webp") background-size: contain background-position: 100% 100% # Read whole tables .pull-left[ - {DBI} package - Connect - Discover - Read - Query Script: `databases_11.R` ```r library(tidyverse) library(DBI) ``` ] --- # Connect to the database First step when accessing the database. ```r con_duckdb <- dbConnect(duckdb::duckdb()) con_duckdb ``` ``` ## <duckdb_connection 0faa0 driver=<duckdb_driver 22170 dbdir=':memory:' read_only=FALSE>> ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Prepare database Normally done in a preparatory step. ```r dm::copy_dm_to( con_duckdb, dm::dm_pixarfilms(), set_key_constraints = FALSE, temporary = FALSE ) ``` --- # Discover tables Where is my data? ```r dbListTables(con_duckdb) ``` ``` ## [1] "academy" "box_office" "genres" "pixar_films" "pixar_people" "public_response" ``` ```r dbListFields(con_duckdb, "pixar_films") ``` ``` ## [1] "number" "film" "release_date" "run_time" "film_rating" ``` Caveat: schemas, catalogs, ... ```sql SELECT * FROM INFORMATION_SCHEMA.TABLES ``` --- # Read tables Read entire tables into your local session, if you can afford it. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") df_pixar_films ``` ``` ## number film release_date run_time film_rating ## 1 1 Toy Story 1995-11-22 81 G ## 2 2 A Bug's Life 1998-11-25 95 G ## 3 3 Toy Story 2 1999-11-24 92 G ## 4 4 Monsters, Inc. 2001-11-02 92 G ## 5 5 Finding Nemo 2003-05-30 100 G ## 6 6 The Incredibles 2004-11-05 115 PG ## 7 7 Cars 2006-06-09 117 G ## 8 8 Ratatouille 2007-06-29 111 G ## 9 9 WALL-E 2008-06-27 98 G ## 10 10 Up 2009-05-29 96 PG ## 11 11 Toy Story 3 2010-06-18 103 G ## 12 12 Cars 2 2011-06-24 106 G ## 13 13 Brave 2012-06-22 93 PG ## 14 14 Monsters University 2013-06-21 104 G ## 15 15 Inside Out 2015-06-19 95 PG ## 16 16 The Good Dinosaur 2015-11-25 93 PG ## 17 17 Finding Dory 2016-06-17 97 PG ## 18 18 Cars 3 2017-06-16 102 G ## 19 19 Coco 2017-11-22 105 PG ## 20 20 Incredibles 2 2018-06-15 118 PG ## [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Read tables Use `as_tibble()` to convert to a tibble for better display and more robust operation. ```r df_pixar_films <- dbReadTable(con_duckdb, "pixar_films") as_tibble(df_pixar_films) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Execute queries Write SQL code to define what data you want to see. ```r dbGetQuery(con_duckdb, "SELECT * FROM pixar_films") ``` ``` ## number film release_date run_time film_rating ## 1 1 Toy Story 1995-11-22 81 G ## 2 2 A Bug's Life 1998-11-25 95 G ## 3 3 Toy Story 2 1999-11-24 92 G ## 4 4 Monsters, Inc. 2001-11-02 92 G ## 5 5 Finding Nemo 2003-05-30 100 G ## 6 6 The Incredibles 2004-11-05 115 PG ## 7 7 Cars 2006-06-09 117 G ## 8 8 Ratatouille 2007-06-29 111 G ## 9 9 WALL-E 2008-06-27 98 G ## 10 10 Up 2009-05-29 96 PG ## 11 11 Toy Story 3 2010-06-18 103 G ## 12 12 Cars 2 2011-06-24 106 G ## 13 13 Brave 2012-06-22 93 PG ## 14 14 Monsters University 2013-06-21 104 G ## 15 15 Inside Out 2015-06-19 95 PG ## 16 16 The Good Dinosaur 2015-11-25 93 PG ## 17 17 Finding Dory 2016-06-17 97 PG ## 18 18 Cars 3 2017-06-16 102 G ## 19 19 Coco 2017-11-22 105 PG ## 20 20 Incredibles 2 2018-06-15 118 PG ## [ reached 'max' / getOption("max.print") -- omitted 7 rows ] ``` --- # Execute queries Write complex SQL code to define what data you want to see. ```r sql <- " SELECT * FROM pixar_films WHERE release_date >= '2020-01-01' " ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` ## number film release_date run_time film_rating ## 1 22 Onward 2020-03-06 102 PG ## 2 23 Soul 2020-12-25 100 PG ## 3 24 Luca 2021-06-18 151 N/A ## 4 25 Turning Red 2022-03-11 NA N/A ## 5 26 Lightyear 2022-06-17 NA N/A ## 6 27 <NA> 2023-06-16 155 Not Rated ``` --- # Execute queries R 4.0 or later: use new-style string literals for mixing quotes. ```r sql <- r"( SELECT * FROM "pixar_films" WHERE "release_date" >= '2020-01-01' )" ``` ```r dbGetQuery(con_duckdb, sql) ``` ``` ## number film release_date run_time film_rating ## 1 22 Onward 2020-03-06 102 PG ## 2 23 Soul 2020-12-25 100 PG ## 3 24 Luca 2021-06-18 151 N/A ## 4 25 Turning Red 2022-03-11 NA N/A ## 5 26 Lightyear 2022-06-17 NA N/A ## 6 27 <NA> 2023-06-16 155 Not Rated ``` --- background-image: url("data:image/png;base64,#images/magic-hat-and-wand-nath-r.svg") background-size: 25% background-position: 100% 100% # Further pointers .pull-left[ ## Quoting ```r dbQuoteIdentifier(con_duckdb, "academy") ``` ``` ## <SQL> "academy" ``` ```r dbQuoteLiteral(con_duckdb, "Toy Story") ``` ``` ## <SQL> 'Toy Story' ``` ```r dbQuoteLiteral(con_duckdb, as.Date("2020-01-01")) ``` ``` ## <SQL> '2020-01-01'::date ``` ```r glue::glue_sql(...) ``` ] .pull-right[ ## Parameterized queries ```r sql <- " SELECT count(*) FROM pixar_films WHERE release_date >= ? " dbGetQuery(con_duckdb, sql, params = list(as.Date("2020-01-01")) ) ``` ``` ## count_star() ## 1 6 ``` ] --- # Read whole tables: Exercises 1. List all columns from the `box_office` table. 2. Read the `academy` table. 3. Read all records from the `academy` table that correspond to awards won - Hint: Use the query `"SELECT * FROM academy WHERE status = 'Won'"` 4. Use quoting and/or query parameters to stabilize the previous query. --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - **Let the database do the heavy lifting** - Subtle issues to watch out for - Basic ETL for one table ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- # Downsizing on the database - {dbplyr} package (part of the tidyverse) - Lazy tables - `collect()` - `select()` - `filter()` - `group_by()`, `summarize()`, `ungroup()`, `count()` Script: `databases_12_1.R` ```r library(tidyverse) ``` --- # Lazy tables A pointer to a SQL table. The data is still on the database! ```r pixar_films <- tbl(con_duckdb, "pixar_films") pixar_films ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># Source: table<pixar_films> [?? x 5]</span> ## <span style='color: #949494;'># Database: duckdb_connection</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with more rows</span> </CODE></PRE> --- # Read the whole table ```r df_pixar_films <- pixar_films %>% collect() df_pixar_films ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> --- # Select columns With `select()`, like with data frames. .pull-left[ ```r pixar_films %>% select(1:3) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># Source: lazy query [?? x 3]</span> ## <span style='color: #949494;'># Database: duckdb_connection</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 ## <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% select(1:3) %>% show_query() ``` ``` ## <SQL> ## SELECT "number", "film", "release_date" ## FROM "pixar_films" ``` ] --- # Select columns and read .pull-left[ ```r df_pixar_films_3 <- pixar_films %>% select(1:3) %>% collect() df_pixar_films_3 ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 3</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Select rows With `filter()`, like with data frames. .pull-left[ ```r pixar_films %>% filter(release_date >= "2020-01-01") ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># Source: lazy query [?? x 5]</span> ## <span style='color: #949494;'># Database: duckdb_connection</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG ## <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG ## <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A ## <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A ## <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A ## <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% filter(release_date >= "2020-01-01") %>% show_query() ``` ``` ## <SQL> ## SELECT * ## FROM "pixar_films" ## WHERE ("release_date" >= '2020-01-01') ``` ] --- # Select rows and read .pull-left[ ```r df_pixar_films_202x <- pixar_films %>% filter(release_date >= "2020-01-01") %>% collect() df_pixar_films_202x ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 6 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'>1</span> 22 Onward 2020-03-06 102 PG ## <span style='color: #BCBCBC;'>2</span> 23 Soul 2020-12-25 100 PG ## <span style='color: #BCBCBC;'>3</span> 24 Luca 2021-06-18 151 N/A ## <span style='color: #BCBCBC;'>4</span> 25 Turning Red 2022-03-11 <span style='color: #BB0000;'>NA</span> N/A ## <span style='color: #BCBCBC;'>5</span> 26 Lightyear 2022-06-17 <span style='color: #BB0000;'>NA</span> N/A ## <span style='color: #BCBCBC;'>6</span> 27 <span style='color: #BB0000;'>NA</span> 2023-06-16 155 Not Rated </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Aggregate With `group_by()` + `summarize()` + `ungroup()`, like with data frames. .pull-left[ ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># Source: lazy query [?? x 2]</span> ## <span style='color: #949494;'># Database: duckdb_connection</span> ## <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> ## <span style='color: #BCBCBC;'>1</span> G 13 ## <span style='color: #BCBCBC;'>2</span> PG 10 ## <span style='color: #BCBCBC;'>3</span> N/A 3 ## <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% group_by(film_rating) %>% summarize(n = n()) %>% ungroup() %>% show_query() ``` ``` ## <SQL> ## SELECT "film_rating", COUNT(*) AS "n" ## FROM "pixar_films" ## GROUP BY "film_rating" ``` ] --- # Aggregate With `count()`, like with data frames. .pull-left[ ```r pixar_films %>% # # count(film_rating) ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># Source: lazy query [?? x 2]</span> ## <span style='color: #949494;'># Database: duckdb_connection</span> ## <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> ## <span style='color: #BCBCBC;'>1</span> G 13 ## <span style='color: #BCBCBC;'>2</span> PG 10 ## <span style='color: #BCBCBC;'>3</span> N/A 3 ## <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Under the hood ```r pixar_films %>% # # count(film_rating) %>% show_query() ``` ``` ## <SQL> ## SELECT "film_rating", COUNT(*) AS "n" ## FROM "pixar_films" ## GROUP BY "film_rating" ``` ] --- # Aggregate and read .pull-left[ ```r df_pixar_films_by_rating <- pixar_films %>% count(film_rating) %>% collect() df_pixar_films_by_rating ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 4 × 2</span> ## <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> ## <span style='color: #BCBCBC;'>1</span> G 13 ## <span style='color: #BCBCBC;'>2</span> PG 10 ## <span style='color: #BCBCBC;'>3</span> N/A 3 ## <span style='color: #BCBCBC;'>4</span> Not Rated 1 </CODE></PRE> ] .pull-right[ ## Data on the database not affected ```r pixar_films %>% collect() ``` <PRE class="fansi fansi-output"><CODE>## <span style='color: #949494;'># A tibble: 27 × 5</span> ## <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> ## <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> ## <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G ## <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G ## <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G ## <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G ## <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G ## <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG ## <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G ## <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G ## <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G ## <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG ## <span style='color: #949494;'># … with 17 more rows</span> </CODE></PRE> ] --- # Downsizing on the database: Exercises 1 * Find several ways to select the 3 first columns * What happens if you include the name of a variable multiple times in a `select()` call? * Select all columns that contain underscores (use `contains()`) * Use `all_of()` to select 2 columns of your choice --- # Downsizing on the database: Exercises 2 Find all films that 1. Are rated "PG" 2. Had a run time below 95 3. Had a rating of "N/A" or "Not Rated" 4. Were released after and including year 2020 5. Have a missing name (`film` column) or `run_time` 6. Were released after and including year 2020 with name not missing 7. Are a first sequel (the name ends with "2") - Hint: Bring the data into the R session before filtering --- # Downsizing on the database: Exercises 3 1. How many films are stored in the table? 2. How many films released after 2005 are stored in the table? 3. What is the total run time of all films? - Hint: Use `summarize(sum(...))`, watch out for the warning 4. What is the total run time of all films, per rating? - Hint: Use `group_by()` --- .pull-left[ ] .pull-right[ ]